Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: .apply will correctly infer output shape when axis=1 #18577

Merged
merged 12 commits into from
Feb 7, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 30, 2017

closes #16353
closes #17348
closes #17437
closes #18573
closes #17970
closes #17892
closes #17602
closes #15628
closes #18775
closes #18901
closes #18919

This fixes apply to work correctly when the returned shape mismatches the original. It will try to set the indices if it possible. Setting to a list-like with axis=1 is now disallowed (but still possible if you operate row-wise). We were applying this inconsitently. This is of course a discouraged practice anyhow.

Prob should add some examples / update the doc-string a bit.

@jreback jreback added API Design Apply Apply, Aggregate, Transform, Map Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 30, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 30, 2017
@codecov
Copy link

codecov bot commented Nov 30, 2017

Codecov Report

Merging #18577 into master will decrease coverage by 0.02%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage   91.45%   91.43%   -0.03%     
==========================================
  Files         157      157              
  Lines       51378    51392      +14     
==========================================
+ Hits        46987    46988       +1     
- Misses       4391     4404      +13
Flag Coverage Δ
#multiple 89.29% <80%> (-0.01%) ⬇️
#single 40.56% <36%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.63% <80%> (-0.28%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c40c8f8...dce186a. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 30, 2017

Codecov Report

Merging #18577 into master will decrease coverage by <.01%.
The diff coverage is 94.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage    91.6%    91.6%   -0.01%     
==========================================
  Files         150      150              
  Lines       48750    48793      +43     
==========================================
+ Hits        44656    44695      +39     
- Misses       4094     4098       +4
Flag Coverage Δ
#multiple 89.97% <94.21%> (ø) ⬆️
#single 41.74% <47.93%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/style.py 96.22% <100%> (ø) ⬆️
pandas/core/frame.py 97.43% <100%> (ø) ⬆️
pandas/core/sparse/frame.py 94.82% <100%> (-0.03%) ⬇️
pandas/core/apply.py 96.77% <93.75%> (-2.66%) ⬇️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 983d71f...1d93380. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Dec 1, 2017

@jreback jreback force-pushed the apply branch 3 times, most recently from c369687 to f6f0371 Compare December 7, 2017 11:19
@jreback
Copy link
Contributor Author

jreback commented Dec 7, 2017

@TomAugspurger @jorisvandenbossche if you have a chance

@chris-b1
Copy link
Contributor

chris-b1 commented Dec 7, 2017

Unfortunately I'm guessing this will break someone's code in an subtle way, but current behavior is obviously bad, so seems like a net positive.

I've never used it so not sure I fully understand the itention, but this does seem to break the reduce keyword, docstring says "If reduce is True a Series will always be returned, and if False a DataFrame will always be returned."

In [23]: pd.__version__
Out[23]: '0.22.0.dev0+310.gf6f0371'

In [24]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[24]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [25]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[25]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3


In [92]: pd.__version__
Out[92]: '0.21.0'

In [93]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[93]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [94]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[94]: 
0    (1, 2, 3)
1    (1, 2, 3)
2    (1, 2, 3)
3    (1, 2, 3)
4    (1, 2, 3)
5    (1, 2, 3)
dtype: object

@jorisvandenbossche
Copy link
Member

@jreback I will try to review this tomorrow

@jreback
Copy link
Contributor Author

jreback commented Dec 7, 2017

i think i can simply deprecate reduce

let me see

@jreback
Copy link
Contributor Author

jreback commented Dec 8, 2017

reduce= is deprecated, but I think we should actually figure out the use case and fix it. IIRC its if the passed apply function returns an empty, what do you return (a Series or a DataFrame). but need to sort thru some issues. maybe I will remove this and we can try later. (its the last commit and orthogonal).

.. ipython:: python

df = pd.DataFrame([[1,2], [1,2]], columns=['a','b'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to display df here to show what it is prior to the apply, similar to what was done in the example above.

df.apply(lambda x: [1, 2, 3], axis=1)
df.apply(lambda x: [1, 2], axis=1)

The returned input will also *not* return a Series with the list-wrapper as previously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use code formatting here: ``Series``

dtype: object


New Behavior. The behaviour is consistent.
Copy link
Member

@jschendel jschendel Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: The second "behaviour" is using British spelling, but the first isn't.

dtype: object


New Behaviour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: British spelling

@@ -4818,6 +4818,8 @@ def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
while guessing, exceptions raised by func will be ignored). If
reduce is True a Series will always be returned, and if False a
DataFrame will always be returned.

.. deprecated:: 0.22.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be mentioned in the whatsnew?

@jreback
Copy link
Contributor Author

jreback commented Dec 10, 2017

I reverted the deprecation off reduce, I can deprecate but its complicated to actually replace it.

@jreback
Copy link
Contributor Author

jreback commented Dec 10, 2017

thanks for comments @jschendel

@jorisvandenbossche
Copy link
Member

I didn't look at any code yet, but some general comments from experimenting with the branch:

  • I am big +1 on making those things more consistent, but I think we should provide an alternative to achieve the original (in the cases where it changes), i.e. have a way to specify you want no inference on the result, just put the result as a scalar in a single cell.
    We should also add documentation that explains the rules here to make this more predictable.

  • Column names of the result: using the example of the whatsnew, there is still an inconsistency in the resulting column names:

    In [1]: df = pd.DataFrame(np.random.randn(4, 3), columns=['A', 'B', 'C'])
    
    In [2]: df.apply(lambda x: [1, 2, 3], axis=1)
    Out[2]: 
       A  B  C
    0  1  2  3
    1  1  2  3
    2  1  2  3
    3  1  2  3
    
    In [3]: df.apply(lambda x: [1, 2], axis=1)
    Out[3]: 
       0  1
    0  1  2
    1  1  2
    2  1  2
    3  1  2
    

    So depending on the length of the list, it reuses the original column names or not. I would personally be in favor of fixing this inconsistency to not reuse the column names (although that would be an additional breaking change for the first case).
    Related, when returning a Series it no longer uses the explicit given names of the result, if the length of the result matches that of the original frame:

    In [10]: df.apply(lambda x: pd.Series([1, 2], index=['test', 'other']), 1)
    Out[10]: 
       test  other
    0     1      2
    1     1      2
    2     1      2
    3     1      2
    
    In [11]: df.apply(lambda x: pd.Series([1, 2, 3], index=['test', 'other', 'cols']), 1)
    Out[11]: 
       A  B  C
    0  1  2  3
    1  1  2  3
    2  1  2  3
    3  1  2  3
    

    So this last example ([11]) is a regression, as before it correctly returned the specified new names.

  • raw=True seems to change some of the inference, while, based on the explanation in the docstring, it should only change the values passed to the function and not the inference afterwards on the result of the function.

    In [24]: df.apply(lambda x: (1, 2), axis=1)
    Out[24]: 
       0  1
    0  1  2
    1  1  2
    2  1  2
    3  1  2
    
    In [25]: df.apply(lambda x: (1, 2), axis=1, raw=True)
    Out[25]: 
    0    (1, 2)
    1    (1, 2)
    2    (1, 2)
    3    (1, 2)
    dtype: object
    
  • Regarding the example of @chris-b1 above (API/BUG: .apply will correctly infer output shape when axis=1 #18577 (comment)), so for tuples this change is breaking the behaviour always (regardless of the length of the tuple). Before, tuples were always regarded as 'scalars', unless you specified reduce=False.
    I am not fully sure I like this change, and as said above I think we need an alternative if keeping to this change.

  • An illustration of where this "now always try to unpack iterable things like tuples into columns" would cause a problem in a real world use case:

    This branch:

    In [53]: from shapely.geometry import MultiLineString
    
    In [54]: def f(row):
        ...:     return MultiLineString([[(1,2), (3,4)], [(3,4), (4,2)]])
    
    In [55]: df.apply(f, axis=1)
    Out[55]: 
                           0                      1
    0  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    1  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    2  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    3  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    

    with 0.21.1 it preserves the returned objects in a single series (the desired result):

    In [94]: df.apply(f, axis=1)
    Out[94]: 
    0    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    1    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    2    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    3    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    4    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    5    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    dtype: object
    
  • I also noticed that the behaviour when returning arrays changed (to be consistent with lists, so probably the change is good), but just noticing it to make sure it is tested (previously it raised an error when the shape didn't match (eg lambda x: np.array([1, 2]) with the example above), now it behaves like a list).

@jorisvandenbossche
Copy link
Member

Contemplating my above comments, I think I would be in favor of (1) "no inference" for general iterable things (objects, tuples, ..), and (2) a clear way to enable / disable inference.

The switch to enable/disable inference could have a default depending on the result type, eg by default infer (so not putting the result as a scalar in a single cell) for Series objects and maybe arrays, and by default no inference for tuples, dicts, generic objects, and probably lists as well. When doing this, I think the main breakage is the case of a "list with length equal to number of columns of original frame", but in that case we could even catch this case and raise a FutureWarning.

Note I didn't look at all examples from the issues yet, so I might be missing some cases.

@jreback
Copy link
Contributor Author

jreback commented Dec 11, 2017

so we need some sort of a parameter to allow inference on the returned result. This is analgous to specifying an output dtype & (shape) in np.vectorize maybe something like:

output=infer|scalar|reduce (and can remove the reduce= & raw= kwargs as well

@jorisvandenbossche
Copy link
Member

(and can remove the reduce= & raw= kwargs as well

I think raw is still orthogonal (it is to determine which type if passed to the function, not related to what is returned by the function).
But broadcast could maybe be integrated.

@jreback
Copy link
Contributor Author

jreback commented Feb 6, 2018

If we decide to not broadcast a Series to the original layout, I think we should at least raise an error for that case, and not silently ignore the result_type keyword you passed.
But it is a breaking change compared to 0.22 (although, I suppose this is rather a corner case .. returning series and using broadcast=True)

this already raises (an incorrect shape whether the result is a list of Series), added a test. (matches 0.22.0)

@jorisvandenbossche
Copy link
Member

Yes, if it has an incorrect shape, then it already raises (just like lists). But my example has a correct length, only different index names.

Using your example above:

In [1]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, columns=['A', 'B', 'C'])
   ...: 

In [2]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[2]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [3]: df.apply(lambda x: Series([1, 2, 3], index=list('abc')), axis=1, result_type='broadcast')
Out[3]: 
   a  b  c
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

I think the above is not the desired result. IMO one of the goals of result_type='broadcast should be to ensure that you can be 100% sure what the result will look like (i.e. dataframe with the original column names).
I don't really care about this case, so raising an error instead of preserving the original column names when the returned Series has different index names (the current behaviour on master for broadcast=True) is fine for me as well. I just think we should not silently ignore result_type='broadcast'. If you want the result as shown above, that is what the default of result_type=None already does, as well as the explicit result_type='expand'.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2018

you think that [3] should actually be [2], IOW we ignore if the resulting index (if any) on a broadcastable? ok I think that's fine.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2018

ok fixed up.

@jorisvandenbossche jorisvandenbossche merged commit 6b0c7e7 into pandas-dev:master Feb 7, 2018
@jorisvandenbossche
Copy link
Member

I think that is our record of closed issues in one go :-)

Thanks @jreback !

@jorisvandenbossche
Copy link
Member

I opened some follow-up issues: #19570, #19571
And a PR with some more doc clean-ups: #19573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment